GitHub Repo

Step 1 recap

My data comes from the Mathematics Genealogy Project at NDSU. It is a database of people who have received as doctorate (or similar) in mathematics. This website includes these data points for a given individual:

Their complete name - Their degrees/dissertations For each university/dissertation: - The name of the university - The year(s) in which their degree was awarded - The complete title of the dissertation - The complete names of their advisors

I scrapped all the pages on the site (253772 of them) and then converted them into a dataframe (tibble).

Let’s import that now:

site.data <- readRDS("sitedata.Rda")
format(object.size(site.data), units="Mb")
## [1] "653.3 Mb"

That process included a good amount of data analysis in itself as I had to handle any pages with odd data that didn’t follow what seemed to be the general format. Here are a couple of examples of this from step 1:

While working on this part I encountered rare instances of odd data. For example, there are some people with no country flag, or just one while they have multiple universities. I additionally found a case of someone where for a single dissertation there are two flags. Because of this I had to switch from a “country” field in the inner schools tibble to a “countries” field that is a list.

Another extremely rare one that was difficult to figure out was the instance of “Ph.D. advisor:” instead of the normal “Advisor” in the text where the advisor links are. In another instance it was “Supervisor” instead. One of this person’s is “Doctoral advisor”. I finally generalized the xpath for the advisor p tag with it’s style, while keeping the search for “Advisor” because some pages with no advisor don’t have the same style for that p tag.

Some other issues I ran into were:

Looking at the case of multiple flags, we can find how many instances of those exist.

allschools <- map_dfr(site.data$schools, ~ .)
allschools[map_lgl(allschools$countries, ~ length(.) > 1),c(1,2,4)]
## # A tibble: 1,377 x 3
##    thesis                           university                         countries
##    <chr>                            <chr>                              <list>   
##  1 On the Kinetic Theory of Steady~ New York University and The Johns~ <list [2~
##  2 (First two degrees were Law deg~ Universiteit Leiden and Universit~ <list [3~
##  3 Lattice Coverings of n-Dimensio~ Tulane University and Uniwersytet~ <list [2~
##  4 Homotopy Associativity of H-Spa~ Princeton University and Universi~ <list [2~
##  5 Some Results in The Geometry of~ Panjab University, Chandigarh and~ <list [2~
##  6 On the Theory of Uniform Distri~ Eötvös Loránd University and Univ~ <list [2~
##  7 ""                               Kazan State University and Lomono~ <list [2~
##  8 "1. Perturbation and Stability ~ Clarkson University and Nagoya Un~ <list [2~
##  9 On The Relation Between Continu~ University of Toronto and Brandon~ <list [2~
## 10 3-Dimensional Bol Loops Corresp~ Friedrich-Alexander-Universität E~ <list [2~
## # ... with 1,367 more rows

The second one is an interesting example, 3 flags (for three colleges).

More data analysis

Lets look at some trends and graphs.

Number of advisors and students

Let’s look at the number of advisors and students people have.

advisorCounts <- map_int(site.data$advisors, length)

summary(advisorCounts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.082   1.000   7.000
plot_ly(y=table(advisorCounts), type="bar") %>% layout(title="Frequency of Advisor Count", xaxis=list(title="Number of Advisors"), yaxis=list(title="Frequency"))
studentCounts <- map_int(site.data$students.id, length)

summary(studentCounts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.082   0.000 150.000
plot_ly(y=table(studentCounts)[0:8], type="bar") %>% layout(title="Frequency of Student Count (< 8 students)", xaxis=list(title="Number of Students"), yaxis=list(title="Frequency"))

I cut off the X-axis at 7 but there are some people with many more students than that, lets look at those with more than 8 students.

plot_ly(y=table(studentCounts)[8:max(studentCounts)], type="bar") %>% layout(title="Frequency of Student Count (> 7 students)", xaxis=list(title="Number of Students"), yaxis=list(title="Frequency"))

Dissertations by Country

filterfreq <- 500
allcountries <- unlist(allschools$countries)
countrycounts <- plyr::count(allcountries)
filteredcountrycounts <- filter(countrycounts, freq >= filterfreq)
filteredcountrycounts$x <- as.character(filteredcountrycounts$x)
filteredcountrycounts <- rbind(filteredcountrycounts, list("Other", sum(countrycounts$freq[countrycounts$freq < filterfreq])))

(ggplot(filteredcountrycounts, aes(x, freq)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=45,hjust=1)) + xlab("Country") + ggtitle("Dissertations by Country")) %>% ggplotly()

It looks like the majority of people in the database are from (or studied in) the United States, with another good portion from Germany, France, and the U.K.

Lets look at the same data on a map.

library(countrycode)
countrycountsiso <- countrycounts
countrycountsiso$c <- countrycode(countrycountsiso$x, "country.name", "iso3c")
nacount <- sum(countrycounts$freq[is.na(countrycountsiso$c)])
#countrycountsiso <- countrycountsiso[!is.na(countrycountsiso$c),]
plot_geo() %>%
  add_trace(
    z = countrycountsiso$freq, color = countrycountsiso$freq, colors = 'Blues',
    text = countrycountsiso$x, locations = countrycountsiso$c, marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
  colorbar(title = 'Number of Dissertations', type="log") %>%
  layout(
    title = 'Dissertations by Country Map',
    geo = list(
      showcountries = TRUE,
      countrycolor = toRGB("light grey"),
      showframe = FALSE,
      showcoastlines = FALSE,
      projection = list(type = 'Mercator')
    )
  )

Bringing it into a graph library

The next step is to bring the data into some graph database or library. The first step to do so is to get the nodes and edges figured out.

I will create a new datatable of just the ID and names of everyone to be the nodes and a datatable of connections, the edges.

nodes <- site.data[c("id", "name")]

There will be two types of edges, those representing that the from id is a student of the to ID and those representing that the from ID is an advisor to the to ID.

splitdf <- function(df, n) {
  split(df, (seq(nrow(df))-1) %/% n) 
}

edges.studentOf <- future_map_dfr(splitdf(site.data[,c("id", "advisors")], 10000), unnest, cols = "advisors", .progress = TRUE)
edges.studentOf$type <- "studentOf"
edges.studentOf$advisors <- unlist(edges.studentOf$advisors) # advisors field are all lists of 1 from unnest
colnames(edges.studentOf) <- c("from", "to", "type")

saveRDS(edges.studentOf, "edges-student-of.Rda")

edges.advisorTo <- future_map_dfr(splitdf(site.data[,c("id", "students.id")], 10000), unnest, cols = "students.id", .progress = TRUE)
edges.advisorTo$type <- "advisorTo"
edges.advisorTo$students.id <- unlist(edges.advisorTo$students.id) # students.id field are all lists of 1 from unnest
colnames(edges.advisorTo) <- c("from", "to", "type")

saveRDS(edges.advisorTo, "edges-advisor-to.Rda")

edges <- rbind(edges.studentOf, edges.advisorTo)

saveRDS(edges, "alledges.Rda")

Running the operation in parallel really helped with RAM usage and made it fairly quick.

Now lets look at what was created. We should have a simple dataframe with from, to, and type.

edges <- readRDS("alledges.Rda")
edges
## # A tibble: 539,497 x 3
##     from    to type     
##    <int> <int> <chr>    
##  1     1   258 studentOf
##  2     2   258 studentOf
##  3     3   258 studentOf
##  4     4   239 studentOf
##  5     5   258 studentOf
##  6     6   258 studentOf
##  7     7   281 studentOf
##  8     8   258 studentOf
##  9     9   281 studentOf
## 10    10   258 studentOf
## # ... with 539,487 more rows

Now let’s bring that into the igraph library.

library(igraph)

net <- graph_from_data_frame(edges, vertices = nodes, directed = TRUE)
net
## IGRAPH f3bb365 DN-- 249371 539497 -- 
## + attr: name (v/c), type (e/c)
## + edges from f3bb365 (vertex names):
##  [1] Ernest Willard Anderson->Dio Lewis Holl         
##  [2] Archie  Higdon         ->Dio Lewis Holl         
##  [3] Donald Hill Rock       ->Dio Lewis Holl         
##  [4] Charles Joseph Thorne  ->John Vincent Atanasoff 
##  [5] Ralph Harry Tripp      ->Dio Lewis Holl         
##  [6] William B. Stiles      ->Dio Lewis Holl         
##  [7] Carl Eric Langenhop    ->Henry Peter Thielman   
##  [8] James W. Beach         ->Dio Lewis Holl         
## + ... omitted several edges
format(object.size(net), units="Mb")
## [1] "43.4 Mb"

We can also make one with the full set of data for the nodes, but it will be a larger structure.

net2 <- graph_from_data_frame(edges, vertices = site.data, directed = TRUE)
net2
## IGRAPH f5e912b DN-- 249371 539497 -- 
## + attr: name (v/c), schools (v/x), advisors (v/x), students.raw (v/x),
## | students.id (v/x), type (e/c)
## + edges from f5e912b (vertex names):
##  [1] Ernest Willard Anderson->Dio Lewis Holl        
##  [2] Archie  Higdon         ->Dio Lewis Holl        
##  [3] Donald Hill Rock       ->Dio Lewis Holl        
##  [4] Charles Joseph Thorne  ->John Vincent Atanasoff
##  [5] Ralph Harry Tripp      ->Dio Lewis Holl        
##  [6] William B. Stiles      ->Dio Lewis Holl        
##  [7] Carl Eric Langenhop    ->Henry Peter Thielman  
## + ... omitted several edges
format(object.size(net2), units="Mb")
## [1] "676.8 Mb"

So now we have it in igraph. This is somewhat of a sanity check in itself since it imported everything fine.

Let’s chart some parts of the graph.

Let’s chart the first 10 edges (and their nodes) with the visNetwork library (they are interactive and can be panned and zoomed).

edge_attr(net, "label") <- edge_attr(net, "type") # set label to type so we see the edge type

library(visNetwork)

visIgraph(subgraph.edges(net, eids=1:10))

Now for a larger example, the first 200 edges.

visIgraph(subgraph.edges(net, eids=1:200))

Let’s look at the neighborhood (order 3) of a specific person, that is those within 3 degrees of seperation.

induced_subgraph(net, unlist(ego(net, nodes = 54719, order = 3))) %>% visIgraph()

What’s next

We now run into the issue that none of the verious graph libraries for R are too good at dealing with such a large graph (most of the things I attempt to do with igraph seem to take forever or crash R) so in step 3 I will try to get the data into a dedicated graph database, potentially neo4j, and do further analysis and exploration from there.